During my undergraduate coursework at Harvard, I have studied data visualization in R, focusing on techniques to present data clearly and effectively. As a brief demonstration of my skills, this portfolio includes three examples of my work:
The first visualization explores the relationship between a movie’s critical acclaim, measured by its Metascore, and its U.S. box office performance.
The second visualization showcases the geographic distribution of median household income across the United States.
The third visualization highlights global birth rate trends over time.
Visualization 1
Show the code
library(ggplot2)library(plotly)library(dplyr)library(readr)library(tidyverse)library(here)# movie datamovies <-read_csv(file =here("movies.csv"))# define thresholdshigh_metascore <-75high_box_office <-500# the static ggplotplot <- movies |>ggplot(aes(x = metascore, y = box_office, text =paste("Title:", title, "<br>Metascore:", metascore, "<br>Box Office: $", box_office, "M"))) +# high-performance region - move this below points by adding it firstgeom_rect(aes(xmin = high_metascore, xmax =100,ymin = high_box_office, ymax =max(movies$box_office, na.rm =TRUE) +50),inherit.aes =FALSE,fill ="#f7cac9", alpha =0.2, color ="#c94c4c" ) +# static data pointsgeom_point(color ="#80ced6", alpha =0.7, size =1.2) +# high performing pointsgeom_point(data =subset(movies, metascore > high_metascore & box_office > high_box_office),color ="#f7786b", size =1.5) +labs(title ="Critic Scores and USA Box Office Returns",subtitle ="Highlighting movies with Metascore > 75 and Box Office > $500M",x ="Metascore (Critic Score)",y ="Box Office Returns (Millions of USD)",caption ="Data represents movies released between 2000 and 2021 in English, with at least $500,000 in box office earnings." ) +# themetheme_minimal(base_size =40) +theme(plot.title =element_text(face ="bold", size =15, color ="#2F4F4F", hjust =0.5),plot.subtitle =element_text(size =10, color ="#556B2F", margin =margin(b =20), hjust =0.5),axis.title =element_text(face ="bold", size =10),axis.text =element_text(size =10, color ="#2F4F4F"),panel.grid.major =element_line(color ="#E0E0E0", size =0.5),panel.grid.minor =element_blank(),plot.margin =margin(t =20, r =25, b =20, l =25), # Increased marginsplot.caption =element_text(size =10, hjust =0, margin =10) )# make it interactiveinteractive_plot <-ggplotly(plot, tooltip ="text") %>%layout(title =list(text =paste0("<b>Critic Scores and USA Box Office Returns</b><br>","<sup>Highlighting movies with Metascore > 75 and Box Office > $500M</sup>" ),font =list(size =15, color ="#2F4F4F"),x =0.5,xanchor ="center" ),annotations =list(list(text ="Data represents movies released between 2000 and 2021 in English, with at least $500,000 in box office earnings.",x =0.5,y =-0.1,showarrow =FALSE,xref ="paper",yref ="paper",font =list(size =8, color ="#2F4F4F"),align ="center" ) ) )interactive_plot
Show the code
library(DT)library(crosstalk)library(ggplot2)library(plotly)# share table data witht the plotshared_data <- SharedData$new(movies)interactive_plot <- shared_data |>ggplot(aes(x = metascore, y = box_office, text =paste("Title:", title, "<br>Metascore:", metascore, "<br>Box Office: $", box_office, "M"))) +# high performance areageom_rect(aes(xmin = high_metascore, xmax =100,ymin = high_box_office, ymax =max(movies$box_office, na.rm =TRUE) +50),inherit.aes =FALSE,fill ="#f7cac9", alpha =0.2, color ="#c94c4c" ) +geom_point(color ="#80ced6", alpha =0.7, size =1.2) +geom_point(data =subset(movies, metascore > high_metascore & box_office > high_box_office),color ="#f7786b", size =1.5) +labs(title ="Critic Scores and USA Box Office Returns",subtitle ="Highlighting movies with Metascore > 75 and Box Office > $500M",x ="Metascore (Critic Score)",y ="Box Office Returns (Millions of USD)",caption ="Data represents movies released between 2000 and 2021 in English, with at least $500,000 in box office earnings." ) +theme_minimal(base_size =40) +theme(plot.title =element_text(face ="bold", size =10, color ="#2F4F4F", hjust =0.5),axis.title =element_text(face ="bold", size =8), # Axis label sizeaxis.text =element_text(size =8, color ="#2F4F4F"), # Axis tick sizepanel.grid.major =element_line(color ="#E0E0E0", size =0.5), # Subtle grid linespanel.grid.minor =element_blank(), # Remove minor grid linesplot.margin =margin(t =20, r =25, b =20, l =25), # Margins )interactive_plot <-ggplotly(interactive_plot, tooltip ="text") %>%highlight("plotly_selected")# interactive tableinteractive_table <-datatable( shared_data,extensions =c("Scroller"),options =list(scrollY =300, paging =FALSE),rownames =FALSE)bscols(widths =c(8, 4), interactive_plot, interactive_table)
The relationship between a movie’s critical acclaim and box office performance is crucial, as filmmakers often choose between prioritizing praise or profitability. This visualization plots Metascore ratings against U.S. box office returns to explore their correlation.
The shaded rectangle highlights high-performing movies with both high Metascores and box office revenues, many of which are commercial blockbusters. Hovering over data points reveals details such as the movie’s title, Metascore, and revenue. An interactive table linked to the plot allows users to search for specific movies, highlighting corresponding points on the plot.
Higher Metascores do not consistently lead to higher box office earnings. Many outliers exist, and movies scoring in the 70-85 Metascore range are often more profitable. This range might represent a balance point, suggesting that these movies manage to appeal to both critics and mass audiences. Movies with exceptionally high Metascores may cater more to critics and niche audiences. Striking a balance between critical praise and audience enjoyment could therefore be key to maximizing commercial success.
Visualization 2
Show the code
library(here)library(socviz)library(tidyverse)library(viridisLite)library(plotly)library(maps)library(leaflet)# theme maptheme_map <-function(base_size =13, base_family ="") {theme_minimal(base_size = base_size, base_family = base_family) %+replace%theme(axis.line =element_blank(),axis.text =element_blank(),axis.ticks =element_blank(),axis.title =element_blank(),panel.grid =element_blank(),panel.background =element_blank(),panel.spacing =unit(0, "lines"),legend.justification =c(0, 0),legend.position =c(0, 0),legend.key.height =unit(0.3, "cm"),legend.key.width =unit(0.8, "cm"),legend.text =element_text(size =8),legend.title =element_text(size =10, face ="bold") )}# load datadata("county_map")data("county_data")# joincounty_map_data =left_join(county_map, county_data, by ="id")# static mapstatic_map <-ggplot(data = county_map_data, mapping =aes(x = long, y = lat, group = group, fill = hh_income,text =paste("County:", name, "<br>Median Income:", scales::dollar(hh_income)) # Add text for tooltip )) +geom_polygon(color ="darkgrey", linewidth =0.1) +# Map polygons with borderscoord_fixed() +# Preserve aspect ratioscale_fill_gradient(name ="Median Household Income\n(2009-2013)",low ="#e0f7fa",high ="#006064",na.value ="grey", # color gradientlabels = scales::label_dollar(scale =1e-3, suffix ="k") ) +labs(title ="County-Level Median Household Income in the U.S.",subtitle ="Data from 2009-2013",caption ="Source: US Census Bureau, Centers for Disease Control" ) +theme_map(base_size =14)# interactive mapinteractive_map <-ggplotly(static_map, tooltip ="text") |>layout(autosize =TRUE,title =list(text ="<b>County-Level Median Household Income in the U.S.</b><br><sup>Data from 2009-2013</sup>",font =list(size =16),x =0.5,xanchor ="center" ),margin =list(l =10, r =10, t =50, b =50) )# display the interactive mapinteractive_map
Income inequality is a continuing unresolved issue in the U.S., and understanding its geographic distribution is essential to address disparities. This map visualizes county-level median household income across the United States from 2009 to 2013. Counties are shaded according to their median income levels, with darker shades representing higher incomes and lighter shades indicating lower incomes. The legend on the right helps readers interpret the data.
One striking feature is the concentration of wealth in urban areas, particularly along the East and West Coasts. Counties near metropolitan regions such as the Bay Area and D.C., stand out with significantly higher incomes. Conversely, rural counties in the South and Midwest often display lighter shades. This visualization highlights the urban-rural income divide, emphasizing the need to target economic growth in underprivileged regions.
# birth rate data from WDIbirth_rate_data <-WDI(indicator ="SP.DYN.CBRT.IN",start =1980,end =2021,extra =TRUE)birth_rate_clean <- birth_rate_data %>%rename(birth_rate = SP.DYN.CBRT.IN,continent = region ) %>%filter(!is.na(birth_rate) &!is.na(continent) &# remove rows with NA, delete 3 regions!continent %in%c("Aggregates", "") )# calculate global and regional averagesbirth_rate_summary <- birth_rate_clean %>%group_by(continent, year) %>%summarise(avg_birth_rate =mean(birth_rate), .groups ="drop")# calculate global averageglobal_avg <- birth_rate_clean %>%group_by(year) %>%summarise(global_birth_rate =mean(birth_rate), .groups ="drop")# merge the newly calculated global value togetherbirth_rate_summary <- birth_rate_summary %>%left_join(global_avg, by ="year")# define the lable colorcontinent_colors <-c("East Asia & Pacific"="#E41A1C","Europe & Central Asia"="#377EB8","Latin America & Caribbean"="#4DAF4A","Middle East & North Africa"="#984EA3","North America"="#FF7F00","South Asia"="#FFFF33","Sub-Saharan Africa"="#A65628","Global Average"="black")
Show the code
# creating animationanimated_plot <-ggplot(birth_rate_summary, aes(x = year)) +geom_line(aes(y = avg_birth_rate, color = continent), size =1) +geom_line(aes(y = global_birth_rate, color ="Global Average"), size =1.5, linetype ="dashed") +geom_point(aes(y = avg_birth_rate, color = continent), size =3, alpha =0.8) +geom_point(aes(y = global_birth_rate, color ="Global Average"), size =3, alpha =0.8) +geom_text(aes(y = avg_birth_rate, label =round(avg_birth_rate, 1), color = continent),vjust =-1.5,size =3,show.legend =FALSE ) +geom_text(aes(y = global_birth_rate, label =round(global_birth_rate, 1)),color ="black",vjust =-1.5,size =3,show.legend =FALSE ) +# add vertilines of important historical eventsgeom_vline(xintercept =c(1990, 2000, 2015), linetype ="dotted", color ="grey", size =0.8) +annotate("text", x =1990, y =40, label ="1990: UN Millennium Goals", color ="grey30", angle =90, hjust =1) +annotate("text", x =2000, y =40, label ="2000: Health Legislation", color ="grey30", angle =90, hjust =1) +annotate("text", x =2015, y =40, label ="2015: SDGs Introduced", color ="grey30", angle =90, hjust =1) +labs(title ="Global and Continental Trends in Birth Rates: 1980 - 2020",subtitle ="Birth rates are measured as live births per 1,000 people.",x ="Year",y ="Birth Rate",color ="Legend" ) +# themetheme_minimal(base_size =14) +theme(plot.title =element_text(hjust =0.5, size =14, face ="bold"),plot.subtitle =element_text(hjust =0.5, size =12),legend.position ="right",legend.text =element_text(size =8),legend.title =element_text(size =10),legend.key.width =unit(0.8, "cm"), legend.key.height =unit(0.4, "cm"),plot.margin =margin(10, 5, 10, 5),panel.grid.major =element_line(color ="#E0E0E0"),panel.grid.minor =element_blank() ) +scale_color_manual(values = continent_colors) +scale_y_continuous(labels = scales::label_number(suffix =" births")) +transition_reveal(year)# displayanimate(animated_plot, nframes =100, fps =90, renderer =gifski_renderer())
The third visualization tracks global and regional trends in birth rates from 1980 to 2020, measured as live births per 1,000 people. Each colored line represents a region, while the black dashed line shows the global average. Key historical policies, such as the UN Millennium Goals (1990), the Health Policy Shift in the US (2000), and the Sustainable Development Goals (2015), are marked on the timeline for context.
The consistent decline in birth rates across all regions reflects broader socioeconomic shifts, including increased access to education, economic development, and outbreak of wars. For example, the Great Recession (2008) corresponds to sharper declines in birth rates in developed regions, demonstrating the impact of economic instability on fertility. Additionally, family planning policies, such as China’s One-Child Policy (1979–2015), illustrate how governance shapes demographic patterns. The convergence of global birth rates highlights a narrowing fertility gap that probably driven by urbanization and women’s labor force participation, though South Asia and Sub-Saharan Africa continue to maintain the highest fertility rates today.